DOCUMENT RESUME 



ED 074 133 



AUTHOR 



TM 002 496 



^oy# Mabel L. Y.,* Barcikowski, Robert S. 

Ifm'h'v^ Item Samplingi Optimum Number of People and items, 

jfuc LhLtt Fee 73 

28p, ■ Paper presented at annual meeting of the 
National Council on Measurement in Education, AERA 
(Nevi Orleans, La,, February 25-March 1, 1973) 

EDRS PRICE MF-$0^65 HC-$3.29 

DESCRIPTORS *Evaluation Techniques^ *Iteni Sampling j *Samplingi 

Speeches I *standarci Error of Measurement; Statistical 
Studies I Tables (Data) ; *Tests 

IDENTIFIERS Monte Carlo Methods 

AESTRACT 

Using a computer-based Monte Carlo approach to 
generate item responses, the results of this study indicate that 
when Item discrimination indices are considered, .Item-examinee 
sampling procedures having the same number af observations have 
different standard errors in estimating both test mean and test 
variance. With certain ty^es of tests, a sinqle item-examinee 
sampling plan would not yield optimal, i. e. smallest standatd error 
estimates of both mu and sigma squared. That is, one samFling plan 
would be needed to optimally estimate mu and another to optimallv 
estiir.ate sigma squared. In addition, it was found that single 
exhaustion of the item set was sufficient for estimating both mu and 
Sigma squared. (Author) - 
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Using a cenputer-based Monte ra^lo approach to eonerate 
itm rosponies, the rteults of this study indlcats that, 
when it-em discriinlnatioii indlcea ara ccnsldersd, iteTn-exaoinec 
sampling procedursg having tha sama nufflber cf observations 
have different standard errors In astLnatlns both Cast 
mean and test variance. With certain types of tests, 
a single Item-exaalnee sampliug plan would not yield 
optlBal, i.e., smalleet standard ex-ror, eatlffiatas of 
both u and That la, one sar,ipllng pUn wsuld h& needed 

to cpttaally estimate p and another to OFtlEially esttadta o^ 
In- addition, 1^ was found that single exhaustion of the It^ 
set was sufficient for estimating both y and c^. 



INTRODUCTTON 

tjith the need for many and continuous evaluation sftudies 
to be perforaed in the iervlce of improved Inatructlcn in our 
schools^ and considering the fact that thsre are always 
limitations of tlmeg money and personnel with which to parfom 
such ivaluitions, it is Important that prQeec^^res be used 
whieh not only provide accurate Irf oraatlon, but are 
eeonomlcal as well. It^in iampllDg haa been suggested as 
just such a procedure* With item satapling^ the savings p 
particularly In test- taking tlme^ can ba enormcus. 

However, when faced with an Bvaluation project, how 
should sehool personnel proceed in Iniplementing an item sampling 
procedure? What guidelines are there Goncernlng the optlmym 
nmber of it™s and examinees to use in a pEttleular situation? 
These questions could be answered if information were available 
on the standard error in esttoatlng a test's mean and 
variance under dondltione similar to that to be encountered 
in this project. Ideally, a sampling plan would be chosen 
that would yield a relatively smell standard error of 
the mean and /or variance. 

Barcikowskl (1970) and Shomaker (1971) ^ have indicated 
that the particular sampling plan chosen doss make a 
difference. Barcikcwskl^ in particular^ called attention to 

the need to consider the range of biaerlal Gorrelatlone between 

: ■ I 

item rssponsi: and ability, i.e.. Item dlserMination. With 



reference to estimating the po:^yla»::Uj:i miinn by itaiu eampllng, 
both Barcikovnki's and Shoemaks^r^s fiiidiusi support the 
use of a largo numbe?: of oubtssfs. Darcikjvaki'o Qtudy^ 
in addition^ wculd racoinmend the uaa of small eisfsd ou^iftstf* 
With reference to the eetlmatlon of the populEtion variance, 
Barcikowskl's flnaings contradlatei .those cf Shoemaker* Where* 
as Shoemaker contended that it^ sapling plans having the 
sane nmnber of obsirvationi havft for all practical purposes 
the iame standard error In asttostlns the variance * Barclkpwski 
found that this was not true v?hen Item tis^ri4l correlations 
were taken into consideration* In the case of tests with 
a range of blsarial rorreiations betv*een *40-.70s ±tm sampling 
plans wim smtll mb^mt^st^m (s.g.j S Items) produced tha 
bert estJiaatee of te^t varlanCGi, 7n th^ casc^ of tMts with 
a range of blEcrlal curirslatlons bMw&sn .05 and *50, iten 
eampllng plans with subteits sonOTlmt less thun half the sla^ 
of the whole teet produced the beet esttoateg of test varianea* 
Ha concluded that optJmal estimates of both the mean and 
variance from a single itm sampling plan may not be possible* 

Most Itm empllng studies have mployed a sMpllng 
plan which might be described ae "single eKhauitlon/* In 
which all of the items on the whole test are used and each 
item appears on only one subtest* Barclkowskl employed 
'Wltlple exhauationi" In which all*\the"ltiS3 on« the' wbcie 
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test are used, but tha frMaMon o£ oubtests io contlmai 
by replacfmen^ cf tha rr&t of ijemi: each time thny era 
exhaustecl. There is a ne«d to determine uhach of the two ' 
procedures is the mora cd^^antegeous. 

Specifically^ this study w.o daaisucd to prwlda answers 
to tha following questions i 

1. Glvati testa with various ranges of blserlal 
correlations betwsen item response and ability, I.e., Item 
dlsorlalnatlon, what is the optlnum number of items and 
examlneea to ba used with item-exaalnes ssmpling? 

2. With Iteas-examlnee sampling, should single or multiple 
eKhaustlQii of the set of Itema on a test ba employed? 

METKOP 

Tha psocadurpsi to generate item ragponscs for tha te,ts 
used In this stiidy are de8crib«d by Bajrclkowskl (1970). 
Briefly, the method uaad was Tuckejr's (1946) inai;he.-aatical 
modol of an item trace line. Tha ItetD trace lines for 
examinees took into account item discrimination, itea difficulty, 
and the ability range of the examlnea, and were asiuned to- 
be of normal ogive form. The model allows direct coMputatlon 
of test population mean and varlanca. 

Since differences in the range of Itan difficulty 
Indices did not affect Barcikcswskl's results, only one 
range o£ rectangularly distributed Item difficulties (.16 to 
.84, centered at .50) was. used. This wai chosen as epproprlate 
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for a single factor teat. 

Table 1 presents the rangas o£ blegrial correlations between 
Itam rasponse and ability uied to craate.swen whole teets. 
Each tests conslited of 60 Items whose blserial correlations 
^rara randomly salaotod from the teEpective range/ ond whoaeilti 
dlfHeultles were randomly selected from the rectangularly 
distributed range of .16 to ,84. A 60-item teat was thought 
to be representative of a test from a standard flchlovement 
battery. A people sample slzft o£ 100 was chosan so that the 
pro'^abillty of obtaining a staple test matn within .25 of the 
pcpulatlon mean wculd be ,92, With a test of 60 Items and a 
sample ci ICO people^ the totnl nij-ber of responses was 
fixed at 6000 (60 X 100). 

The total number of response.s was held constant at 
6000, and the tctal test slza waa held constant at 60 items. 
The size of the subtests was varied over values of 6, 10, 
15, 20, and 30 items. This study employad single exhaustion, 
multiple axhauafclon, and an extreme situation in which the 
number of people taking each subtest was two. Table 2 
presents the mrnber of peopla that took each subt-est and the 
number of subtests Involved for each of the above-oentionsd 
plans. The product of the number of subtests (t), the 
subtest size (k), and the numcar of people that took each 
subtest (n) equals the total of number of obsarvations or 
responses (6000), 
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TABLE 1 

RANGES OP BISERIAL CORRELATION BETraEM ITEM RESPONSE /iND 
.^ILITY USm IN CONSTRUCTION OF VJHOLI TESTS 



POSlTICW OF KANGE 



SIZE OF 


RANGH 


LOW MIDDLE HIGH 


a 




.10-. 20 .50-. 60 .30-. 90 


.5 




.05". 53 .20-. 70 .45-. 95 


.9 




.OS-. 95 






TABLE 2 


wmnm of items ok each subtest and the coerespondins number 

OP SUBTESTS AND NUl-iBER OF PEOPLE TAKING EACH SUBTEST 
UNDER SINGLE EXHAUSTION, baiLTIPLE EXMUSTION, AND 
EXTREME ITai SAMPLING 










SINGLE 


MULTIPLE EXTREME 



SUBTEST Number Number Niaaber. l^umber Numbsr Number 

SIZE of of Qf of of of 

Subtests People Subttits^ People Subtegte People 



6 


10 


100 


■ 20 


50 


500 


2 


10 , 


6 


, 100 


20 


30 


300 


2 


13 


4 


100 


20 


20 


200 


2 


20 


3- 


100 


J.\J 


, IS 


150 


2 


30 


2- 


100 


^f^ 


10 


50 


2 



One hundred estimates of teat moan and variance wgre 
acquired Ux each ■sv.btest size. Th'; population mflans and 
varlancass warn t\\m used to compute the sums o£ squared rrrors 



100 



abbreviasod SSKV. 

'rraditional sampling with 6000 tol:al rijr.bar o£ responses 
was studied by giving each of the seven wholQ tests to 100 
people and obtaining 100 estimates of test maah and variance 

per whole test. The population maans and variances were then 
A 100 3 

used to compute the sums of squared errors I - u-^) -and 
2 2,2 T«l 

T=l 

. The estimates of test mean and variance were compared' to 
the population means and varlancei and to each other by the 
usr of these sums o£ squared errors. If, for example, the sum 
of squared errers were 5 for one sampling plan and 10 for 
another sampling pUti, the f titdo of SSE's would be less 
than one, thus IndlcctlnB the superlorlcy of the former over 
the latter sampling plani 



RESULTS Aim DICCtJRSION 

Tha actual population means and yariancee of thfi seven 
^^hols tests used in this study aire preeentcd in Table 3. 
In this table the tests are listed according to the siae of 
tlm range of biserial corralEtlonj and within each slie 
according to the situ of the variance assoclatad with them. 

The results of tatlmatlng teat population mean for the 
various eaapling plc-s are given in Tables 4, 5, and 6, It 
can be seen that as total tcit variance decreased, and con- 
sequently the average value of thc^ biserial correlations of 
the tests as well, oitlmates of the population mean iioprovad. 
One eKplanation of this occur£i>ce is found in the relation- 
ship betWBM the binarial correlHtione and test varianca* As 
test biseifial correlations decreased , teit varaince also de- 
creased^ thus reducing the standard error of the mean. 

The\ data .preientfed in Tables 6, and 6 ala^felndicate 
that the better es'timatas of the mean are given by iampling 
plans with fewer itens (smaller k*s) and more subtssts (larger 
t's)t As the subtist size Increased and the number of sub- 
tests decreased, the esttoatea becs^a poorer s as evidence by 
the larger SSars. For example, in Table 4^ tests with EVtrage 
biserial ccrrelationii of .80-. 90 under the sapling plan of 10 
subteits and 6 items per suntest (10/6/100) had SSM of 43,63, 
^ereas tests with the saiaa ^nra^c biierial sorreAationi of 
,80-. 90, but undGr the sEmpling plan of 2 subtesta and 30 it^s 
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per subtest (2/30/100), had a mm of 184.10. Traditional san-. 
Filing ,(ks6Q, t=l) supplied the poorest estimate, with a SSM of 
328.70. These res-alts agree with the theory presented in 
lord and Novick (1968), and served as a partial check on the 
modal. ■ 

The data in Tables 7, 8, and 9 present cooparisons of 
single exhaustion, multiple exhaustion, and the extriaa item 
sampling plana. Table 7 compares elngle exhaustion to multiple 
axhaustlon. In the case of trnts with avsrege blserial corre- 
lations of .80". 90, the ratios of SSmije are 1.43, .74, .77, .97, 
and 1.44 for subtast sizes of 5, 10, 15, 20, and 30, raspectlvely. 
Of these, ratios, 3 cut of 5 arc lass than 1.00 (viz., .74, .77, 
and .97), thus f av or ins o ingle oxhaustlon item sanpiing, A total 
of 25 out of 35 ratios in Toblo 7 £moi single exhaustion over 
multlple exhaustion. Table 8 compares single exhsugtlon to ex- 
treme item sampling. In the case of teats with average bisarlal 
correlations of .4S-.95, the ratios of SSEM's are 1.73, 1.01, 
1.58, .86, and .84 for subtest slges of 6, 10, 15, 20, and 30, 
respectlvray. Of these ratios, 3 out of 5 are greater than 1.00 
(vli., i. 73, 1,01, and 1.58), thus favoring the extreme plan 
over single exhaustion. A total of 19 out of 35 ratios In Table 8 
favor the extreme plan over single exhauatlon. labia 9 com- 
pares the extreme plan to Qultiple eaihaustlon item sampling. In 
the case of tests with a-rarage blserial corralatlons of .50-. 70, 
the ratios of SSM's are 1.17, 1,43, .84, 1.11, end .86 for sub- 
test sizes of 6, 10, 15| 20, and 30, respectively. Of these ••: 
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ratios, 3 out of 5 are greater than 1.00 (viz,, 1,17, 1.43, and 
1.11), thus footing Bxtvm.^B Item sampling, A total of 25 out 
of 35 ratios iii Table 9 favor the fextrene plaa over multiple ox- 
haustion. 

The results cf estimating teat population variance for 
the various plane are given in Tables 10, 11, and 12. 

In coniidering the question of optlraum number of people = 
and it^a for -estimating test population variance, it would esem 
that for tests with higher average biserlal correlations, the 
optimal sampling plan would be one with small sized subtests. 
Thus, In Table 10, under the sampling plan 10/6/100 (6 items 
par gubtest)^ tests with average biserlal cori'elatlons of .80-. 90 
had a SSEV of 19399, whareas undar the flsfflpllng plan 2/30/100 
(30 items per subtsst), the SSEV is 49603. considering 
tests with average bissrial correlations In the lower ranges 
of valuea (e.g., . 10- . 20 , . OS- . 55) , ; the optlnal sa&sllng plan 
consists of subtasts with larger numbers of Items per subtest. ' 
In fact, in the case of single exhaustion Item sampling, the 
best plan Involves a subtest size one half the size of the whole 
test. This can be seen In Table 10,' where the smallest SSEV 
Is 976. This value falls under the saffipllng plan 2/30/100 with 
30 Items per subtest, which Is half the size of tha whole test 
of 60 items, mien considering tasts with average biserlal cor- 
relations In the middle ranges (e.g., ,50-. 60, .20-. 70) j the 
trend is not as cXear, but sampling plans with small sized sub- 
tests do tend to be better. Thus, in Table 10 tests with aver- 
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age blserlal correlations of .50-. 60 under eampllni plan 

10/6/100 (k=6) had a- SSEV of 9037; m&cma under sampling plan 

2/30/100 (k=30), the S81V li 16442, 

An examination of the data of Taijles 13, 14, and IS clearly 

Indicates that for eotlmatlng population varliirte, single es- 
hsu&ticn Itrai. aampllng la suparlor to either nultiple ex- 
haustion or the .estrme Item saopllng. This can be seen by the 
preponderance of ratios less than 1.00. A^total of 27 out of 
35 ratlts in Table 13 favor single e^diQustlon over m^^^^ 
exhaastlon and a total of 23 out of 35 ratios In Table 14 fa- 
vor slnglo exhaustion over tha esti-emQ plan. 

SUmEY 

The purposi of this study img to help dstermlnaoptimum^ 
item sEoplln^ plans for use in school evaluation projects. 
The findliigB prasented pfovlded the basis for the following con- 
clusion, 

does the range of blflarlal correlations hetwean 

MmmS.^^Um» ixS-, itM dlecriRlnatlon. affect the choice 
of 0£t^ ifiOT asEllna £1^ 

Mhan eBtlmating test population mean, no matter what the 
vftiuM of the blserlal coDrelatlona ara, better estlmatas are 
given by the Item saopllng plans with fetms Items and more ' 
iubteats. Similar concli-«lQns were drawn by Barclkowskl (Wo), . 
ShOOTokw (1971) also reaDma^dsd the^u large RUmbeB of ; 

subteits. The resultg of; this s:udy:enppci:t .their findings. ; 
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The case Is different wher. eatimating test population variance. 
Hfire, tho cpr.liTigm sgmpling plan deFands very much on the blserial 
corirelfitlon of the test in questiou. With blsorlal correlatioas 
in the higher rangas, e.g., .SO-. 90 and .45-. 95, emplow.ant of 
subtests with fewer items produce batter estimates. In the casa 
of tests with biserlal correlations In the lower rangQi, e.g., 
.10-. 20 and .05-. 55, Just the opposlta Is true. Hare, subtests 
containing more items produced batter variaace estimates. 
Specifically, In the case of single subtests one half the sisa of 
tha whole test. For tests with svmv&gs biserial correlations in 
the middle ranges, p.g. , 150-.6Q and .20-. 70, the results, al- 
though not as clsai--cut, also tendad to favor smaller subcesta. 
These conclusions support tha flmllngs in Barcikowski's study 
of an intaraction effect between ittem charactc-istics (biserial 
correlations) and sampling plan. They, however, contradict 
Shoemaker's couclualon that ltem-ex.saiinie sampXing plans have, 
for all practical purposes, the sgmQ standard error In estimating 
population variance. This study Indicates that such Is not tha 
case when Item biserial correlations are considered. 

Shoul£ slnale or multiple exhauotlon. of the se^ of^ itKmm_ 
t employed? The raaults favor single exhaustion Item 

sampling for the estimation of population mean ant variance for 
most practical purposes. This is decidedly true in the casa of 
population variance estlmatea. In tlia case of population moan ■ 
estimates, extreme item sampling had a alight edge over single 
eshaustion itm saffiP^.inSj but tha difference is not sufficient to 
warrant the production of the extremely large number of subtests 
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neceasary to implement the extreme Item sampling design. 

RECOl-MENDATrONS 

For educational practitioners who desii-e guldellr.as In the 
application of item lampllng techniques to school evaluation 
situations, the results of thii study suggest the followins ra- 
conmsndetions : 

1. Iwien the rni-aaatar pi'inary importance Is the pop- 
ulation mean test scora, E2,nU aizai subtests should be used 
(e.g., subtasts of 5 Icems, with the data fron this study), 
and as many of these subtests as cost fnctcrs Indicate to 

be practicable. 

2. When the parameter of primary importance is the pop- 
ulation test variance, attention should be given to the item 
discrimination of the test. With taats of low iteni dis- 
crimination i&.g, - Mta.blserla.l correlatleiia in the range 
.05-. 35), larger subtests ' should be used, specifically, sub- 
tests approaching half the size of the whole test. In all 
other Instances, smaller subtasts may be used, 

3. ■Single aKhaustion of the itftm set is sufficient for 
either population mean or variance estimates. 
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